D654-D659 Nucleic Acids Research, 2014, Vol. 42, Database issue 
doi:10.1093lnar/gktl048 



Published online 7 November 2013 



DOOR 2.0: presenting operons and their functions 
through dynamic and integrated views 

Xizeng Mao^ Qin Ma^'^ Chuan Zhou^'^ Xin Chen^''*, Hanyuan Zhang^''*, Jincai Yang^ 
Fenglou Mao\ Wei Lai^ and Ying Xu^'^'"*'* 

^Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute 
of Bioinformatics, University of Georgia, Athens, GA 30602, USA, ^BioEnergy Science Center (BESC), Oak Ridge 
National Laboratory, Oak Ridge, Tennessee 37831, USA, ^School of Mathematics, Shandong University, Jinan, 
Shandong 250100, China, "^College of Computer Science and Technology, Jilin University, Changchun, Jilin 
130012, China and ^College of Computer Science, Central China Normal University, Wuhan, Hubei 430079, China 

Received September 1, 2013; Revised October 10, 2013; Accepted October 11, 2013 



ABSTRACT 

We have recently developed a new version of the 
DOOR operon database, DOOR 2.0, which is avail- 
able online at http://csbl.bmb.uga.edu/DOOR/ and 
will be updated on a regular basis. DOOR 2.0 
contains genome-scale operons for 2072 prokary- 
otes with complete genomes, three times the 
number of genomes covered in the previous 
version published in 2009. DOOR 2.0 has a number 
of new features, compared with its previous version, 
including (i) more than 250000 transcription units, 
experimentally validated or computationally pre- 
dicted based on RNA-seq data, providing a 
dynamic functional view of the underlying operons; 
(ii) an integrated operon-centric data resource that 
provides not only operons for each covered genome 
but also their functional and regulatory information 
such as their c/s-regulatory binding sites for tran- 
scription initiation and termination, gene expression 
levels estimated based on RNA-seq data and con- 
servation information across multiple genomes; (iii) 
a high-performance web service for online operon 
prediction on user-provided genomic sequences; 
(iv) an intuitive genome browser to support visual- 
ization of user-selected data; and (v) a keyword- 
based Google-like search engine for finding the 
needed information intuitively and rapidly in this 
database. 

INTRODUCTION 

Operons have been widely used as the basic transcriptional 
and functional units when studying higher-level functional 



systems in prokaryotes such as biochemical pathways, 
networks and regulation systems since the concept was 
proposed by French scientists Jacob and Monod in 1960 
(1). Although it has never been suggested by the two sci- 
entists in their original paper, computational prediction of 
operons often treats them as units that do not overlap with 
each other (2,3), as this greatly simplifies operon predic- 
tion on the genomic scale. For the past decade, an increas- 
ingly popular term being used is 'transcriptional units', 
which are experimentally identified 'operons' as defined 
by Jacob and Monod in 1960 and may have overlaps. 

The emergence of large-scale RNA-seq data for increas- 
ingly more prokaryotic organisms has made it possible 
to elucidate 'operons' in their fuU complexities, as few 
genome-scale transcriptomic data collected under multiple 
conditions have been used to reveal the dynamic structures 
of the statically predicted operons under different experi- 
mental conditions (4). We envision that the need for elu- 
cidation of the condition-dependent transcriptional units 
(TUs) (4,5) wiU continue to increase, as increasingly more 
RNA-seq data become available. Throughout this article, 
we use operons to refer to static non-overlapping 'tran- 
scriptional units' while using TUs to refer to operons ac- 
cording to the original definition of Jacob and Monod, i.e. 
sequences of consecutive genes that each encode a single 
RNA molecule along with their own promoters and ter- 
minators. The typical relationship between operons and 
TUs is that TUs tend to be sub-units of operons, while 
in some cases, a TU may span more than one operon. 

As of now, a number of operon databases have been 
publicly deployed by different research groups, including 
RegulonDB (5), ODB (6), DBTBS (7), OperonDB (8), 
ProOpDB (9) and DOOR (10) that was developed by 
our laboratory. These databases differ in their coverage 
of the operon information, and only a few have TU data. 
For example, the current version of RegulonDB contains 
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>800 unique TUs for Escherichia coli (5) and ODB has 
10000 TUs (11), both collected from the public domain. 
Most of these databases do not contain regulatory infor- 
mation for their operons such as transcription factor 
binding sites and transcription terminators. In addition, 
none of these database servers provide services for 
onHne operon prediction on user-provided genomic se- 
quences; only ODB provides 4812 reference operons that 
can potentially be used to assist operon prediction. 

The new version of the DOOR database, DOOR 2.0, 
covers all the 2072 completely sequenced prokaryotic 
genomes in the NCBI genome database (as of April 
2012), which is three times the number of genomes 
covered in its previous version pubhshed in 2009. In 
addition, DOOR 2.0 has several new features, namely, 
(i) 254 685 TUs collected from pubhc databases such as 
RegulonDB (5) and Palsson's dataset (4) or computation- 
ally predicted based on RNA-seq data; (ii) an integrated 
operon-centric data resource offering operons, their regu- 
latory binding sites for transcription initiation (TFBSs), 
transcription terminators, gene-expression levels estimated 
based on RNA-seq data and their conservation informa- 
tion across multiple genomes; (iii) a high-performance web 
service for operon prediction on user-provided genomic 
sequences, powered by a backend computer cluster with 
>150 computing nodes; (iv) an intuitive genome browser 
to support visualization of user-specified data in the 
database; and (v) a keyword-based Google-hke search 
engine for finding the needed information in the 
database intuitively and rapidly. To the best of our know- 
ledge, DOOR 2.0 is the first web-based operon database 
that integrates all such capabilities. Together, it provides 
an easy-to-use environment for discovering new informa- 
tion and synthesizing new knowledge about operons, their 
function, regulation and evolution across aU sequenced 
prokaryotes. The database can be accessed at http://csbl. 
bmb.uga.edu/DOOR/, which will be updated on a regular 
basis when new prokaryotic genomes are released. 



DATABASE UPDATE 

DOOR 2.0 contains operons for 2072 complete prokary- 
otic genomes that were downloaded from the NCBI 
Genome FTP server (April 2012), which consists of 1939 
bacteria and 133 archaea, with 2205 chromosomes and 
1645 plasmids. We predicted 1 323 902 multi-gene 
operons using our prediction program (12), on average 
~583 such operons per chromosome and ~24 operons 
per plasmid, and 2 578 949 single-gene operons, as 



detailed in Table 1. All the operons are stored in a 
MySQL relational database on our server and can be 
accessed efficiently through different ways. A user can 
browse operons by organisms or chromosomes/plasmids 
that are organized into a searchable HTML table under 
the 'Organisms' navigation menu. The operons for an 
organism can be downloaded through the 'Download' 
hnk on the 'hsting operons' page. A user can search for 
individual operons in the search box using keyword(s), 
which is located in the upper right corner of the web 
page (Figure lA). The user can also specify more 
complex queries by using multiple keywords connected 
through Boolean operators just as in Google, whose 
details can be found in the onhne manual at the DOOR 
2.0 web server (Figure lA). 

NEW FEATURES IN DOOR 2.0 

DOOR 2.0 consists of 254 685 (1385 experimentally 
verified and 253 300 predicted) TUs for 24 prokaryotic 
genomes, 6408 verified TFBS for 203 prokaryotic genomes, 
3456 718 i?/io-independent terminators for 2072 genomes 
and 6 975 454 conserve operons. The reason that only 24 
organisms have TU information is that only those organ- 
isms each have a large number of RNA-seq data, sufficient 
for reliable TU predictions. We expect that this number 
will increase rapidly as the more genome-scale RNA-seq 
data become available. 

The previous version of DOOR supports the following 
features: (i) an online operon database for 675 prokaryotic 
genomes, (ii) a menu-based interface for finding user- 
specified attributes of operons, (iii) a motif prediction 
service for user-specified operons and (iv) a Wiki page to 
facihtate communications between the users and the de- 
veloper. DOOR 2.0 has kept all these features except for 
'operon search based on its number of genes' and the Wiki 
page, as we found that they have not been actively used 
based on the usage statistics in the past 4 years. In 
addition, DOOR 2.0 has a number of new features, 
selected based on users' inputs as well as our expectation 
of what might be needed by users of an operon database, 
based on our own research experience of comparative 
genome analyses. 

INTEGRATION OF TUs 

An operon may be transcribed into different TUs under 
different experimental conditions, which tend to be sub- 
operonic with their own promoters and/or terminators (13), 



Table 1. The key statistics of DOOR 2.0 



Category 


Number of operons 


With TUs 


With TFBSs 


With terminators 


Number of conserved operon 


Species 


2072 


24 


203 


2072 


N/A 


Chromosome 


2205 


24 


224 


2205 


N/A 


Plasmid 


1645 


0 


13 


1645 


N/A 


Operon (M) 


1 323 902 


254685 


4229 


1493 272 


6 975 454 


Operon (S) 


2 578 949 


N/A 


2260 


1963 446 


N/A 



Operon (M), multi-gene operons; Operon (S), single-gene operons; N/A, not applicable. 
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Figure 1. (A) A screenshot of a display window. (B) A display of TUs, with the red bars representing genes, the first row of the blue bars 
representing multi-gene operons and the following rows of blue bars being TUs under different conditions. (C) A display of validated or predicted 
transcription factor binding sites (the left bottom) and 7?/7o-independent terminators (on the right); and (D) conserved operons. 



whereas in some cases could be super-operonic, which 
spans at least two operons (4). The TUs can be derived 
through RNA-seq analysis. In addition, numerous TUs 
have been experimentally validated in various prokaryotes 
and stored in public databases (5). 

We have collected 1385 experimentally validated TUs in 
E. coli from the RegulonDB database (5) and Palsson's 
dataset (4), with 941 and 842 from the first and the second 
dataset, respectively. In addition, we have predicted 
253 300 TUs for all 24 bacterial genomes with genome- 
scale RNA-seq data in the NCBl SRA database (release 
of March 2013) (14) using our in-house program SeqTU 
(manuscript in preparation), 119 RNA-seq datasets being 
used for our prediction. SeqTU is a machine learning- 
based classifier for detecting boundaries between consecu- 
tive TUs on the same strand of a genome. 

All the TUs are stored in a relational database and can 
be retrieved and displayed through the genome browser 
(Figure IB). A user can examine TUs within an operon 
using the 'operon' page. Like operons, each TU has its 
own gene hst with their genomic coordinates, underlying 
RNA-seq data, and an accuracy score if the TU is pre- 
dicted by SeqTU. These items are individually clickable 
for more detailed information. A user can examine 



individual TUs via the genome browser by double- 
clicking the relevant RNA-seq ID in the left panel of the 
browser, which are not displayed by the default setting. To 
help the users to examine the expression values of a gene 
of interest, DOOR 2.0 provides a BigWig XY plot for 
each underlying RNA-seq data (15), where a user can 
double-click on the relevant BigWig item for more 
detailed information. 



INTEGRATION OF TRANSCRIPTION REGULATORY 
ELEMENTS 

DOOR 2.0 provides experimentally verified TFBSs for 203 
organisms and predicted intrinsic transcriptional termin- 
ators for all 2072 organisms, which can be used to study 
transcriptional regulation of operons. 

We have collected 6489 verified TFBS for 203 organ- 
isms from RegulonDB (for E. coli only) (5) and 
RegTransBase (for 202 organisms) (16). All the TFBSs 
for each operon, if available, are displayed in an HTML 
table on the operon page, and can be examined along the 
underlying chromosome through the genome browser. 
TFBSs are not shown by default when an operon is dis- 
played, but a user can double-click on the relevant menu 
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in the left panel of the genome browser to turn on this 
feature. Each TFBS displayed is clickable, through which 
a user can find out the more detailed information such as 
its name, genomic coordinates and the DNA sequence (see 
Figure IC). 

DOOR 2.0 also provides a de novo TFBS prediction 
capabihty for user-selected operons using two programs: 
BoBro (17,18) and MEME (19,20). In all, 300-bp 
upstream sequences of the selected operons will be auto- 
matically retrieved from the selected genomes, and the 
predicted TFBSs will be displayed in an HTML table 
along with the coordinates, the P-value measuring the 
statistical significance of the prediction, the consensus 
sequence and a WebLogo (21) (see Figure 2). 

It is known that prokaryotes use two different mechan- 
isms of transcription termination: i?/io-independent 
(intrinsic) and i?/7o-dependent (22). i?/2o-dependent ter- 
mination involves the binding of a Rho factor to the 
niRNA to destabilize the RNA-DNA interaction to stop 
transcription, whereas P/io-independent termination func- 
tions by creating an RNA hairpin loop to stop the RNA 
polymerase (23). i?/!o-independent terminators can be 
reliably predicted based on identification of the conserved 
RNA hairpin loop, whereas i?/!o-dependent terminators 
cannot yet, due to the lack of known signals, be associated 
with them. 

We have predicted 3 456 718 i?/!o-independent termin- 
ators, on average ~2.6 terminators per operon, suggesting 
alternative terminators for each operon, for all the 2072 
organisms using the TranstermHP program (23), the best 
terminator predictor in the public domain, with the 
default parameters. All the terminators for each operon 
can be displayed both in an HTML table on the operon 
page and through the genome browser (see Figure IC). 



INTEGRATION OF CONSERVED OPERONS 
ACROSS BACTERIA 

We have included the orthologous relationships among 
multi-gene operons across different bacterial genomes. 
Such information can be used for studies of operon evo- 
lution, such as elucidation of the hfe cycle of an operon 
(24). For two operons a and b in genomes A and B, re- 
spectively, we define a 'similarity score' between them as 
follows: 



S{a, b) 



\orth(a, b)\ 



(\Gia)\+\Gm/2 



where G(a) and G(b) denote the component genes in a 
and h, respectively; orth(a,b) represents the orthologous 
gene pairs between a and b identified by our prediction 
program GOST (25) (see Figure ID); and \X\ denotes the 
number of elements in X. Intuitively, the score = 1 if and 
only if all genes in a and b are one-to-one mapped to 
orthologous gene pairs; and the score = 0 if and only if 
no orthologous genes between a and b are detected. 
Generally, the higher the score, the higher percentage of 
genes in a and b are orthologous pairs. We consider a pair 
of operons a and b as conserved if S( a,h ) is at least 0.7. 
Using this cut-off, 6 975 454 conserved operon pairs have 
been identified among the 2072 genomes. For any specific 
operon, a user can retrieve its conserved operons across aU 
the other 2071 genomes in DOOR 2.0 by selecting the 
relevant menu item on the browser. 



A NEW WEB INTERFACE 

The web interface of DOOR 2.0 is completely redesigned 
compared with the previous version. The new features of 
the interface include (i) an intuitive genome browser based 
on JBrowse Genome Browser (http://jbrowse.org) (26) 
that supports visualization of all the aforementioned 
data types related with operons along with a scrollable 
and zoomable chromosome for each organism; (ii) a new 
keyword-based Google-like search engine implemented 
using the Sphinx Open Source Search Server (http:// 
sphinxsearch.com), through which a user can enter one 
or a few keywords to search for operons that have the 
specified attributes, e.g. coli, lactose, NC_00913, and can 
also formulate the search key as a complex query with 
Boolean operators (see onhne document on the DOOR 
2.0 web server for examples); and (iii) an intuitive Web 
2.0 HTML table (DataTables, https://datatables.net) that 
supports on-the-fly filtering, multi-column sorting, 
variable length pagination and asynchronous loading for 
large datasets. 



ONLINE OPERON PREDICTION 

DOOR 2.0 offers an intuitive high-performance web 
service for online operon prediction. A user can have 
operons predicted in a newly sequenced genome or any 
provided prokaryotic genome sequence by uploading 
three types of data into the server, including chromosomal 
DNA sequence (in fna format as used by the NCBI 
Genome FTP Server), protein sequence {J'aa format as 



id 
1 



length number Convinced motif number Pvalue 
4 3 1.494E-10 



14 



14 



1 1 



I I I 



Seq 



ivlotifs 

Consensus sequences:TCATTCTATGAAAT 
start end Motif Score Info 





98 


TCATTCTATGAAAT 


16.29 


21574l| 




^184 


TCATGCTATGAAAA 


15.37 


36319^ 




^08 


TCATTTTATGAATT 


14.47 


21575^ 


13 169 


182 


TAATGTTAAGAAAT 


13.49 


2157510 



1.895E-09 



Consensus sequences:TGGCGTAAAGGCTA 



Figure 2. A screenshot of motif search results for a user-selected operon. 
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used by the NCBI Genome FTP Server) and gene location 
{ptt format as used by the NCBI Genome FTP Server). AU 
the submitted jobs are put automatically into a job queue, 
which are executed in a 'first-in, first-served' manner on 
our computing cluster. Once the job is done, the user wiU 
be notified via email with links to the web pages contain- 
ing the computational results. All the predicted operons 
are displayed in an intuitive HTML table and stored on 
the DOOR 2.0 server for half a year. 

IMPLEMENTATION 

DOOR 2.0 is implemented as a web portal server with a 
multi-layer architecture. The representation and the logic 
layers are implemented using the Web 2.0 technology 
(HTML5, CSS3 and Javascript language along with 
jQuery hbrary) and PHP server-side scripting language. 
AU data are stored in an optimized MySQL relational 
database. The keyword-based search engine is imple- 
mented based on the Sphinx Open Source Search Server 
(http://sphinxsearch.com), and the genome browser is im- 
plemented based on JBrowse Genome Browser (http:// 
jbrowse.org) (26) and integrated into DOOR 2.0 using 
the iframe (inline frame) HTML tag. The web server 
runs on a Red Hat Enterprise Linux 6 box (16 Intel 
Xeon CPUs with 2.4 GHz and 16 GB memory), and auto- 
mated operon prediction pipeline runs on the computing 
cluster server with >150 computing nodes (2 Intel Xeon 
CPUs with 3.06 GHz and 2.5 GB memory per node). 

CONCLUDING REMARKS 

Here we presented a new version of the DOOR operon 
database, DOOR 2.0. Although the previous version has 
been widely used (with over ~120 citations since its pub- 
lication in 2009), we feel that it is time to develop and 
deploy a new version of the database to include aU the 
prokaryotic genomes sequenced in the past few years, 
the available TU information experimentally validated 
or computationally derivable from RNA-seq data, as 
weU as regulatory signals for each operon, which can be 
predicted based on comparative genome analysis. To best 
facihtate data retrieval, analysis and integrated applica- 
tions of these data, we have developed a highly intuitive 
genome browser to support the visualization of these data 
types. With the high quality of our predicted operons, 
along with their regulatory signals and evolutionary con- 
servation information, we believe that the new version of 
DOOR will continue to serve as a main source of operon 
data for the microbial research community. 
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